Mrs. Jones, the PM in charge of this project, requests that you analyze this survey data, come up with actionable insights, and if there is monetization potential there, build a product that makes use of said insights, so as to optimize the corresponding revenue stream. Afterwards, you are to report to the stakeholders (Mrs. Jones' team) and provide arguments as to why this project is worth the CMO's time and resources.
According to Mr. Patel, the PM involved in the development and the deployment of the questionnaires involved, the people replying to them were a typical sub-group of the users of the company's site. However, some of them didn't reply to all of the questions, while the background information of some of the users is incomplete. Josie, the analyst who gathered all the data says that this shouldn't be an issue since the sample is large enough (and quite representative of the whole population). She went on to save this data in a .csv file and send it to you with a comment that all missing values are marked as "-".
Best of luck to you in your analysis, and may the Force be with you!
path = pwd() # get the current path for use later on
data_location = "d:\\data"
if path != data_location
cd(data_location)
end
data = readdlm("CaffeineForTheForce.csv", ',')
Hm, it looks like there are some missing values in the MoviesWatched variable. Let's see how big this issue is across the dataset.
N, n = size(data)
var_names = data[1,:]
data = data[2:end,:]
N -= 1 # no need to count the header as a data point
for i = 1:n
println(var_names[i], '\t', sum(data[:,i] .== "-"))
end
In order to handle the various variable types more effectively, let's put them all in a data frame.
using DataFrames # need to load the corresponding package first. Note that we could have loaded the data using the readtable() command from this package, if we wanted.
df = DataFrame(data)
for i = 1:n
cn = symbol(string("x", i))
nn = symbol(var_names[i])
rename!(df, cn, nn)
end
head(df)
Looks like we have a few outliers as well (cases 3, 4, and possibly more). Between these two extreme users that are visible in the above preview, the former has watched 44 movies (even though she is still a toddler!) while the latter has spent more than 10x as much money than the next high-spender. The Force must be strong with these two! Alternatively, these cases could be just typos (in the first case that's almost certain since at the time of this writing there are only 7 movies out there). Either way, since Jedis are non common among the company's clientele, it would be best to deal with thess outliers by either removing them, or substituting its value for something that would make more sense. In the meantime, we can use median instead of mean for handling the missing values of the numeric features (since this metric is not influenced by outliers that much).
However, in order to do the missing value handling more efficiently, it would make sense to do some data representation at this point. So, from the above preview we can deduce that the variables of this dataset can be encoded as Int64, Int64, AbstractString, Int64, AbstractString, AbstractString, Float64, and AbstractString, at least for the time being. As Julia doesn't want to make any assumptions about the data types, everything is encoded as the most generic data type, Any. Also, even though we won't be using variable "ID" as a feature, it is useful to encode it as an integer, since we may need to reference it.
To make this whole type conversion process simpler, let's put all the desired data types in a single array first.
dt = [Int64, Int64, AbstractString, Int64, AbstractString, AbstractString, Float64, AbstractString]
As most of our variables have missing values, it would be best to convert them afterwards. In the meantime we can work with the target variables, MoneySpent and FavorityProduct, as well as the ID variable.
df[:MoneySpent] = convert(Array{dt[7],1}, df[:MoneySpent])
df[:FavoriteProduct] = convert(Array{dt[8],1}, df[:FavoriteProduct])
df[:ID] = convert(Array{dt[1],1}, df[:ID])
Since our main task with this data is to classify each case into a product that they like (FavorieProduct variable), it makes sense to look at each variable on the basis of each class and use that information to fill in the missing values.
Q = unique(df[:FavoriteProduct])
m = length(Q)
println(Q, '\n', m)
C = Array(Array{Int64,1}, 4) # initialize class array (an array of arrays)
for i = 1:m
C[i] = collect(1:N)[df[:FavoriteProduct] .== Q[i]] # indexes of class i
end
function FindNonMissingElements(X::Array{Any,1}, ind::Array{Int64,1})
# finds all non-missing elements of array X that have an index belonging to ind
Z = []
for x in X[ind]
if x != "-"
push!(Z, x)
end
end
return Z
end
for i = 1:N
for j = 1:5
var_name = symbol(var_names[j+1])
if df[var_name][i] == "-"
c = [1,2,3,4][Q .== df[:FavoriteProduct][i]][1] # class of current case
X = FindNonMissingElements(df[var_name], C[c]) # non-missing elements of feature j, for the particular class of case i
T = dt[j+1] # type of variable j
if T <: Number
df[var_name][i] = round(T, median(X)) # since all of our feature variables are of type Int64, it makes sense to keep them this way
else
df[var_name][i] = mode(X)
end
end
end
end
Now all our missing values should have been taken care of. Let's confirm this by taking another look at the first part of our data frame.
head(df, 10)
Looks like everything is clean. Let's now finish encoding the variables into the most appropriate types
for i = 2:6
var_name = symbol(var_names[i])
df[var_name] = convert(Array{dt[i],1}, df[var_name])
end
typeof(df[:Age]) # check one of the variables to make sure it's of the correct type
Let's now delve deeper into our data and see what else we can do to improve its quality.
Since this dataset is quite simple, we'll tackle it using just plots, although a stats-based approach would also work.
using Gadfly # need to load a visualization package first
Let's first start with a few histograms. Firstly, let's examine the target variables since these are of the highest importance.
plot(df, x = "FavoriteProduct", Geom.histogram)
It looks like most of our users are into apps, followed closely by the book fans. Very few people seem to not care about SW products at all. How much are the spending on this stuff though? Another histogram may shed some light on this question.
plot(df, x = "MoneySpent", Geom.histogram)
So, there is someone who has spent over $10000 on SW merchantise. Let's take a closer look at this person's data.
ind = collect(1:N)[df[:MoneySpent] .> 10000]
df[ind,:]
Well, this man seems like a very big fan (for one thing, he is a moderator in our blog) and he has been on this planet for 4 decades, but it's doubtful he spent that much money (especially considering the products sold are not that expensive). Even if he did, such a data point is bound to create some issues in our model later on (where we'll try to predict the money spent, based on the first five features), so it's best to deal with this outlier. One good strategy for this is to find the most similar user to him and copy that person's MoneySpent value. Let's start by looking at other moderators out there (there can't be that many, since it's a quite specialized role).
ind = df[:BlogRole] .== "Moderator"
df[ind, :]
It looks like there is a very similar user (the one with ID = 7) who has watched about the same number of movies, is also a big fan of episode VI, is around the same age, is also a male moderator, and is fond of SW books too. It wouldn't be a stretch to copy this person's MoneySpent data to our outlier.
df[ind, :MoneySpent] = df[7, :MoneySpent]
Now let's take a look at the actual histogram of the MoneySpent variable.
plot(df, x = "MoneySpent", Geom.histogram)
Let us now continue our exploration through the remaining variables (our features), omitting the ID one since it is just an identifier.
plot(df, x = "MoviesWatched", Geom.histogram)
Hm. It seems that the outlier we identified earlier wasn't the only case. There is another person who has watched close to 80 movies! So, unless we are dealing with a Jedi master who is using the Jedi mind trick on us, this record is unreliable and needs to be processed. Before we do anything, let's take a closer look at these outliers (basically anyone who has watched more than 7 movies):
ind = collect(1:N)[df[:MoviesWatched] .> 7]
df[:MoviesWatched][ind]
It appears that whoever was typing this data had limited control over the motor functions of their fingers. So, a reasonable assumption is that these outliers correspond to 4 and 7 respectively. If we were not sure about what the actual values were, we could treat them as missing values (go back to Data Preparation). Also, if we had several more outliers, we should change them first and then treat the missing values anew, to ensure that the median value we are using is not thrown off by the outliers.
df[:MoviesWatched][ind] = [4, 7]
Now let's see what the actual histogram of this variable looks like.
plot(df, x = "MoviesWatched", Geom.histogram)
No surprises here. Let's look at the FavoriteEpisode variable though.
plot(df, x = "FavoriteEpisode", Geom.histogram)
No outliers here (though the fact that someone liked episode III more than all of the others is somewhat suspicious!). Interestingly, no-one liked episodes I and II the most, so there is still hope for the SW fans of the years to come... What about the Age variable though?
plot(df, x = "Age", Geom.histogram)
Hm. It looks like we have a centenial among our ranks! Moreover, that person seems to qualify for the Guiness Book of World Records. Maybe it's a Jedi Master following the steps of Yoda... Let's take a closer look at this person.
ind = collect(1:N)[df[:Age] .> 100]
df[ind, :]
Definitely an old-timer, but it's doubtful this person is that old. It would be best to deal with this case as a missing value, since it's not feasible to accurately guess their actual age (he also belongs to a relatively common class of users). The fact that this person is into SW books (class 2) could help us provide a replacement value that won't affect our predictions of him, particularly the favorite product part. First, let's make his age a missing value.
df[ind,:Age] = 999 # since we can't use the "-" symbol for Int64 variables, we'll go ahead and give it an easily distinguishable value
Before we continue though, it's best to take a look another percular cases, a user whose age is on the other end of the spectrum (perhaps a Jedi prodigy?).
ind = collect(1:N)[df[:Age] .< 10]
df[ind, :]
Since there is no obvious way to replace this outlier in age, it would be best to also treat that user's age as a missing value.
df[ind, :Age] = 999
ind = collect(1:N)[df[:Age] .== 999] # indexes of problematic cases (for the Age variable)
for i in ind
c = [1,2,3,4][Q .== df[:FavoriteProduct][i]][1] # class of current case
nme = setdiff(C[c], ind)
X = convert(Array,df[nme, :Age]) # non-missing elements of feature j, for the particular class of case i
df[i, :Age] = round(Int64, median(X)) # since all of our feature variables are of type Int64, it makes sense to keep them this way
end
Now let's take a look at a more realistic histogram of the Age variable, before we continue to the Gender one.
plot(df, x = "Age", Geom.histogram)
plot(df, x = "Gender", Geom.histogram)
Looks like the majority of our users are guys. Let's carry on to the next and final feature variable.
plot(df, x = "BlogRole", Geom.histogram)
Nothing spectacular here. I'm pretty sure that distribution holds across most blogs. Let's delve a bit deeper though and see what relationships we can find among all these variables.
plot(x = df[2], y = df[3]) # relationship between MoviesWatched and FavoriteEpisode
Not much to see here (which is good, since it shows that these two features are independent from each other).
plot(x = df[2], y = df[4]) # relationship between MoviesWatched and Age
Nothing interesting here either. Apparently if someone is interested in the SW saga, age doesn't affect how many movies they watch.
plot(x = df[2], y = df[5]) # relationship between MoviesWatched and Gender
Same here. Gender isn't a factor when it comes to how many SW films they are going to watch.
plot(x = df[2], y = df[6]) # relationship between MoviesWatched and BlogRole
There is a very weak signal here. It seems that if someone hasn't watched all 7 movies, they are bound to not be a moderator.
plot(x = df[2], y = df[7]) # relationship between MoviesWatched and MoneySpent
Not much to work with here either. Perhaps a faint signal connecting the number of movies watched with how much money they spend on SW stuff. It could be a fluke though.
plot(x = df[2], y = df[8]) # relationship between MoviesWatched and FavoriteProduct
Even though this is again a quite weak signal, it may be somewhat useful. Based on this, if someone has watched 6 movies, they are unlikely to be completely disinterested in any of the SW products.
plot(x = df[3], y = df[4]) # relationship between FavoriteEpisode and Age
It looks like episode III appeals to a relatively younger audience (compared to episodes V and VI, for example). That's something worth exploring further (data discovery stage)
plot(x = df[3], y = df[5]) # relationship between FavoriteEpisode and Gender
Interestingly, none of our female users seemed to like episode VII the most.
plot(x = df[3], y = df[6]) # relationship between FavoriteEpisode and BlogRole
Interestingly, those who like episodes III or VII the most, weren't blog moderators
plot(x = df[3], y = df[7]) # relationship between FavoriteEpisode and MoneySpent
It seems that those who liked episode VI the most tend to have spent more on SW stuff, overall.
plot(x = df[3], y = df[8]) # relationship between FavoriteEpisode and FavoriteProduct
It appears that those who liked episodes IV or III were definitely not apathetic when it came to SW products, while those who prefered episode VII over the others, were not so much into video games.
plot(x = df[4], y = df[5]) # relationship between Age and Gender
Interestingly most of our older users are men.
plot(x = df[4], y = df[6]) # relationship between Age and BlogRole
It looks like our most active users (who are not moderators) are somewhat younger than our normal users. Also, our moderators are all middle-aged.
plot(x = df[4], y = df[7]) # relationship between Age and MoneySpent
Apparently, the high-spenders are in the middle aged group
plot(x = df[4], y = df[8]) # relationship between Age and FavoriteProduct
Not unexpectedly, our older users are more inclined to prefer books or nothing at all, when it comes to SW products.
plot(x = df[5], y = df[6]) # relationship between Gender and BlogRole
This is definitely not the plot we are looking for (no interesting information whatsoever). Let's move along...
plot(x = df[5], y = df[7]) # relationship between Gender and MoneySpent
It looks like our male users spend more on SW stuff, overall
plot(x = df[5], y = df[8]) # relationship between Gender and FavoriteProduct
Not surprisingly, our female users are not so much into video games
plot(x = df[6], y = df[7]) # relationship between BlogRole and MoneySpent
Although active users and normal users can spend anything between 0 and $1000 on SW stuff, moderators always spend about the same and that's also a lot, compared to the other users.
plot(x = df[6], y = df[8]) # relationship between BlogRole and FavoriteProduct
It is clear that moderators are interested either in apps or books, but are definitely not apathetic nor into video games.
plot(x = df[7], y = df[8]) # relationship between MoneySpent and FavoriteProduct
Not surprisingly, those who are not into apps, or books, or video games, don't spend as much (perhaps they just go to conventions or something)
Clearly, each one of the variables we have is not all that insightful, if used as-is (a quite typical phenomenon actually). Perhaps if we were to isolate the signals in them through partitioning the nominal ones...
Just from the data exploration part, we can easily draw a few actionable insights regarding their favorite product: if a user has watched 6 movies, this is a somewhat useful predictor (Favorite product is not None). The same applies if their favorite episode is VII, in which case the possibility of video games as a favorite product is scratched. Also, if the user is a she, their favorite product has to be something other than video games. Finally, if the user is a moderator, they are definitely into one of the 3 types of products. All these insights can be captured in various custom-made binary features such as :
(we could create additional binary features to represent all the possibilities but before doing so, it would be best to do some more digging in the data discovery stage, otherwise we run the risk of the feature set size exploding)
df[:IsMoviesWatched6] = (df[:MoviesWatched] .== 6)
df[:IsFavoriteEpisode7] = (df[:FavoriteEpisode] .== "VII")
df[:IsFemale] = (df[:Gender] .== "F")
df[:IsModerator] = (df[:BlogRole] .== "Moderator")
head(df)
We can create an index array for these new features, along with the Age one, for easier access to them.
F = ["Age", "IsMoviesWatched6", "IsFemale", "IsModerator", "IsFavoriteEpisode7"]
F = map(symbol, F)
Now it seems that we are ready for the next stage...
First, let's make use of a quite handy similarity metric (which is still not widely used), the symmetric similarity index. Just like all similarity metrics, it takes values between 0 and 1, the higher denoting more similarity. It is symetric in the sense that it captures negative similarities as well, something that the conventional similarities metrics fail to do. This allows you to see the actual similarity, just like the absolute value of the correlation coefficient for numeric features.
function SSI(x::BitArray{1}, y::BitArray{1})
a = sum(!x & !y)
b = sum(!x & y)
c = sum(x & !y)
d = sum(x & y)
N = a + b + c + d
x0 = (a + d) / N
x1 = (b + c) / N
return 2*max(x0, x1) - 1
end
Although this function works for bit arrays only (for performance reasons), it's fairly easy to use on conventional binary arrays, by applying the BitArray() function to them.
SSI(BitArray(df[:IsFemale]), BitArray(df[:IsModerator]))
What about the similarity of a non-binary feature with a target? For that we can use the generalized version of SSI, which is designed to assess the predictive power of any nominal feature, for a non-binary target:
function SSI{T1 <: Any, T2 <: Any}(x::Array{T1, 1}, y::Array{T2, 1})
N = length(x)
Qx = sort(unique(x))
Qy = sort(unique(y))
cc = [length(Qx), length(Qy)]
X = BitArray(N, cc[1])
Y = BitArray(N, cc[2])
S = Array(Float64, cc[1], cc[2])
W = Array(Float64, cc[1], cc[2])
for i = 1:cc[1]
for j = 1:cc[2]
xx = (x .== Qx[i])
yy = (y .== Qy[j])
W[i,j] = sum(xx & yy)
S[i,j] = SSI(xx, yy)
end
end
W /= N
return sum(W.*S), S
end
SSI(convert(Array,df[:FavoriteEpisode]), convert(Array, df[:FavoriteProduct]))
Clearly, this feature has potential, since it can predict the favorite product being video games (class 3) or if it's none (class 4) about 70% of the time, although it is a mediocre predictor for the other two classes.
SSI(convert(Array, df[:IsFavoriteEpisode7]), convert(Array, df[:FavoriteProduct]))
SSI(convert(Array, df[:MoviesWatched]), convert(Array, df[:FavoriteProduct]))
Since several features tend to have some valuable aspects and others that are not that useful, it would make sense to binarize all the nominal features. Besides, this kind of representation would be useful for some classifiers.
function binarize{T <: Any}(X::Array{T, 1}, nv::Int64 = 0) # X is a nominal feature
ux = sort(unique(X))
if nv == 0; nv = length(ux); end
Y = falses(length(X), nv) # feature values
fn = Array(ASCIIString, nv) # feature names
for i = 1:nv
Y[(X .== ux[i]), i] = true
fn[i] = string("Is_", ux[i])
end
return Y, fn
end
function binarize{T <: Any}(X::Array{T, 2}) # X is a set of nominal features
N, n = size(X)
if n == 1; return binarize(vec(X), 0); end # single feature case
M = Array(Int64, n) # number of unique values for the various features in X
UX = Array(Any, n) # Unique values of X
for i = 1:n
UX[i] = sort(unique(X[:,i]))
M[i] = length(UX[i])
end
tnbf = sum(M) # Total Number of Binary Features
Y = falses(N, tnbf) # feature values
fn = Array(ASCIIString, tnbf) # feature names
c = 0 # binary feature counter
for i = 1:n # feature to be binarized
for j = 1:M[i] # value of feature i to be processed
c += 1 # binary feature
Y[(X[:,i] .== UX[i][j]), c] = true
fn[c] = string("Feat", i, "_", UX[i][j])
end
end
return Y, fn
end
non_numeric_features = convert(Array,df[[:MoviesWatched, :FavoriteEpisode, :Gender, :BlogRole]])
bdf, feat_names = binarize(non_numeric_features);
# binarized nominal features
feat_names
temp = convert(Array, df[:FavoriteProduct])
bfp, product_name = binarize(temp) # binarized favorite product
N, nbf = size(bdf) # nbf = number of binary features
ndf = DataFrame() # all numeric data frame
for i = 1:nbf
ndf[symbol(feat_names[i])] = DataArray(float(bdf[:,i]))
end
ndf[:Age] = df[:Age]
ndf[:MoneySpent] = df[:MoneySpent]
for i = 1:4
ndf[symbol(product_name[i])] = DataArray(float(bfp[:,i]))
end
head(ndf)
We can run a few more tests on the dataset and see if we can come up with other insteresting variables (e.g. combos of the ones we have already), but we have a deadline to meet! So, let's save our work so far and carry on to the next stage of our analysis. To ensure that even non-Julians can access this data, we save everything in .csv format.
writetable("CaffeineForTheForce_processed.csv", df)
writetable("CaffeineForTheForce_all_numeric.csv", ndf)
ndf = readtable("CaffeineForTheForce_all_numeric.csv") # we can skip this command if we are running the script from beginning to end. It is very useful however to have it in case we need to redo, or check the data modeling part, without having to go through the data engineering bit every time.
N, n = size(ndf)
n -= 5 # adjust the number of features by subtracting the number of target variables
Let's start by loading the necessary packages (we could use additional ML tools, but these should be enough for the problem at hand).
using DecisionTree # a package with some commonly used classifiers
using BackpropNeuralNet # a package for a basic neural network
using ELM # a package for Extreme Learning Machines
using GLM # a package for Generalized Linear Models (regression)
using ROC # a great validation package for classification problems
Here we'll see how we can get the computer to classify a user based on the features we have, onto the four categories of products they may be interested in.
First, let's prepare the data for the classifiers / regressors. We are going to use a 70%-30% split, which is typical for this kind of applications (although this could probably be done using a package, it's so simple that it would probably not save us any effort).
r = randperm(N);
tr = round(Int64, N*0.7) # number of training data points
te = N - tr # number of testing data points
train_data = df[r[1:tr],:]; # training set for our data learning experiments
test_data = df[r[(tr+1):end],:]; # testing set for our data learning experiments
Now it would be a good time to see if we can get the computer to learn any patterns to be able to predict the FavoriteProduct variable for various users. Let's start with one of the simplest classifiers (which is also quite popular for problems of low dimensionality like this one).
ind = [2, 3, 4, 5, 6, 9, 10, 11]
features_training = convert(Array,(train_data[:, ind]));
labels_training = convert(Array,train_data[:, 8]);
features_testing = convert(Array,(test_data[:, ind]));
labels_testing = convert(Array,test_data[:, 8]);
model1 = build_tree(labels_training, features_training); # create the decision tree
model1 = prune_tree(model1, 0.9); # cut down a few nodes to make it less prone to overfitting
pred = apply_tree(model1, features_testing)
AR = sum(labels_testing .== pred) / length(labels_testing)
This doesn't look very promising. Let's try another model along these lines...
model2 = build_forest(labels_training, features_training, 3, 20); # params = number of features in each tree, number of trees
pred = apply_forest(model2, features_testing)
AR = sum(labels_testing .== pred) / length(labels_testing)
That's better. Apparently a single tree can't create a good enough generalization but a series of them can do the trick. Let's see how a different kind of tree ensemble performs.
model3, coeffs = build_adaboost_stumps(labels_training, features_training, 10);
pred = apply_adaboost_stumps(model3, coeffs, features_testing)
AR = sum(labels_testing .== pred) / length(labels_testing)
Apparently, this doesn't help our case much. Maybe we'll have better luck with a different type of classifier, such as a neural network.
Before doing so though, we'll need to get the all-numeric version of the dataset, since neural nets are incompatible with anything else.
train_data = ndf[r[1:tr],:]; # training set for our data learning experiments using all-numeric features
test_data = ndf[r[(tr+1):end],:]; # testing set for our data learning experiments using all-numeric features
n = size(train_data, 2) - 5 # you can skip this command if you have already computed the number of variables n
features_training_new = map(Float64,Array(train_data[:,1:n]))
features_testing_new = map(Float64,Array(test_data[:,1:n]))
It would be a good idea to normalize the last feature (Age), since, unlike Jedis, neural networks can get confused very easily with scale.
max_ = maximum(features_training_new[:,end])
min_ = minimum(features_training_new[:,end])
features_training_new[:,end] = (features_training_new[:,end] - min_) / (max_ - min_);
features_testing_new[:,end] = (features_testing_new[:,end] - min_) / (max_ - min_);
labels_training_new = map(Float64,Array(train_data[:,(end-3):end]))
labels_testing_new = map(Float64,Array(test_data[:,(end-3):end]))
println(labels_training_new[1:10,:])
println(labels_testing_new[1:10,:])
net = init_network([n, round(Int64, 1.5*n), m]); # create an artificial neural net
for j = 1:2500 # train for 2500 epochs
if mod(j,500) == 0; println(j); end # this is just to show how the training progresses
for i = 1:size(features_training_new, 1)
a = features_training_new[i,:]
b = labels_training_new[i,:]
train(net, a[:], b[:])
end
end
nn = size(features_testing_new, 1)
pred_labels = Array(Float64, nn, m)
for i = 1:nn
pred_labels[i,:] = net_eval(net, features_testing_new[i,:][:])
end
function ANN2vector{T <: Any}(Z::Array{Float64, 2}, Q::Array{T,1})
# turn a matrix one from an ANN into a vector output
N, noc = size(Z)
z = Array(T, N)
p = Array(Float64, N)
for i = 1:N
p[i], ind = findmax(Z[i,:])
z[i] = Q[ind]
end
return z, p
end
QQ = sort(Array(Q))
pred, prob = ANN2vector(pred_labels, QQ)
AR = sum(labels_testing .== pred) / length(labels_testing)
Hmm. Perhaps there is a limit to how good performance we can get with this dataset. Let's see if ELMs can give us a better prediction.
targets_training = ones(tr)
targets_testing = ones(te)
ind_book = (labels_training .== "Book")
ind_none = (labels_training .== "None")
ind_game = (labels_training .== "VideoGame")
targets_training[ind_book] = 2.0
targets_training[ind_none] = 3.0
targets_training[ind_game] = 4.0
ind_book = (labels_testing .== "Book")
ind_none = (labels_testing .== "None")
ind_game = (labels_testing .== "VideoGame")
targets_testing[ind_book] = 2.0
targets_testing[ind_none] = 3.0
targets_testing[ind_game] = 4.0
elm = ExtremeLearningMachine(35);
ELM.fit!(elm, features_training_new, targets_training)
outputs = ELM.predict(elm, features_testing_new)
Clearly, we'll need to adjust the outputs so that they can be mapped to the 4 classes we have.
outputs[outputs .< 1.0] = 1.0
outputs[outputs .> 4.0] = 4.0
pred = QQ[round(Int64, outputs)]
AR = sum(labels_testing .== pred) / length(labels_testing)
Time to evaluate the results in more depth. In this part we'll be using the ROC curve, which is a very robust validation strategy. Since only some of the classifiers performed adequately, we'll focus on just these ones (random forest and neural network).
prob = apply_forest_proba(model2, features_testing, QQ)
p = maximum(prob, 2)
p = p[:]; # make probabilities Array 1-dimensional
z = (apply_forest(model2, features_testing) .== labels_testing)
rc1 = ROC.roc(p, z)
ROC.plot(rc1)
AUC(rc1)
pred, prob = ANN2vector(pred_labels, QQ)
z = (pred .== labels_testing);
rc2 = ROC.roc(prob, z)
ROC.plot(rc2)
AUC(rc2)
Based on all of the above analysis, we can conclude that the random forest classifier works best for this particular problem.
Let's now do the same for the prediction of the MoneySpent variable.
join(names(train_data), " + ")
reg_model1 = GLM.fit(LinearModel, MoneySpent ~ Feat1_3 + Feat1_4 + Feat1_5 + Feat1_6 + Feat1_7 + Feat2_III + Feat2_IV + Feat2_V + Feat2_VI + Feat2_VII + Feat3_F + Feat3_M + Feat4_ActiveUser + Feat4_Moderator + Feat4_NormalUser + Age, train_data)
Most of these factors in our regression model fail to pass the significance test, so they'll have to go. The Feature 2 factors seem to be significant, even though they have a very high coefficient, so let's look into them more.
reg_model2 = GLM.fit(LinearModel, MoneySpent ~ Feat2_III + Feat2_IV + Feat2_V + Feat2_VI + Feat2_VII + Age, train_data)
Clearly these aren't the factors we are looking for. Let's move along now, to an even simpler model...
reg_model3 = GLM.fit(LinearModel, MoneySpent ~ Age, train_data)
w = coef(reg_model3) # weights
reg_pred = test_data[:Age] * w[2] + w[1]
Time to evaluate the results. In this section we'll use the mean squared error (MSE) metric, which is the most popular way to assess a regression model.
MSE(t::DataArray{Float64, 1}, y::DataArray{Float64, 1}) = mean((t-y).^2)
MSE(test_data[:MoneySpent], reg_pred)
Based on the above, we can conclude that the Age feature is the only one that's actually relevant in predicting the amount of money spent by one of our members.
For this stage, there is not much we can do in Julia. However, we decide to use a GUI development software that makes use of our Julia model on the backend. The product becomes a success.
After the 1-on-1 meeting with Mrs. Jones, it became apparent that one thing that's very useful for the her, as well as the other stakeholders in this project, is the value of the different data used in our models. This can lead her to make important decisions about how we will organize the next release of this project and provide the marketing team with the requirements of the next version of the questionnaire. So, without wasting too much time we set off to deliver this useful insight, in a quite visual way.
Z = DataFrame(Any, length(ind), 2)
c = 0
for i in ind
c += 1
Z[c, 1] = names(df)[i]
Z[c, 2] = SSI(convert(Array,df[:,i]), convert(Array, df[:FavoriteProduct]))[1]
println(Z[c, 1], ": ", round(Z[c, 2], 3)[1])
end
rename!(Z, symbol("x1"), symbol("Feature"));
rename!(Z, symbol("x2"), symbol("Favorite_Product_Similarity"));
visual = plot(Z, x="Feature", y="Favorite_Product_Similarity", Geom.bar, Scale.y_continuous(minvalue=0.2, maxvalue=0.4))
draw(PNG("CftF final visual.png", 30cm, 15cm), visual)
So, in the future it would make sense to have a question like "Is episode VII your favorite episode?", along with the age and gender ones, while possibly adding some new questions.